Processor Shadowing: Maximizing Expected Throughput in Fault-Tolerant Systems

نویسندگان

  • John L. Bruno
  • Edward G. Coffman
  • Jeffrey C. Lagarias
  • Tom J. Richardson
  • Peter W. Shor
چکیده

This paper studies parallel processing as a device for increasing fault tolerance. In the first of two basic models, a single job with a given running time is to be run on a finite set of processors; each processor is subject to failure but only while running a job. If a job is running on only one processor, and that processor fails, then the job must be restarted on another processor, assuming not all processors have already failed. To avoid such losses in accrued running time when at least two processors are available, it can be decided at any time to run the job synchronouslyon two processors in parallel, a replication technique we call shadowing. Clearly, shadowing has its own downside: while two processors are running, the failure rate is doubled. We show how to resolve this trade-off optimally; we devise a policy that schedules shadowing in such a way as to maximize the probability that the job finishes before all processors fail. We prove that the policy is of threshold type. That is, depending on the number of processors and the duration of the job, there is an optimal time to begin shadowing; once started, shadowing continues so long as neither processor fails and the job does not complete. We also show that the thresholds are monotone in the number of processors, i.e., if more processors are initially available, then shadowing should be started sooner. In the second of our two models, we have the same set-up except that we have an unbounded number of jobs, each having the same running time, and the objective is to maximize the expected number of jobs completed before all processors fail. We show that the optimal policy is again of threshold type, but that the thresholds are, surprisingly, not monotone in the number of processors. The optimal thresholds have a curious oscillatory behavior that we study in detail. Variants of the above problems are also analyzed using the same methods; several other variants are left as interesting open problems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Real-time Fault-tolerant Scheduling Algorithm for Distributed Computing Systems

This article proposes a Distributed Realtime Fault-tolerant model, priority Real-time Fault-tolerant algorithm and computational architecture of Distributed Real-time Fault-tolerant. According to this model, the problem of how to schedule a weighted Directed Acyclic Graph (DAG) in Distributed computing system for high reliability can be solved in the presence of multiprocessors faults. When som...

متن کامل

A Formal Description of FTAG for Multi-Processor Systems

FTAG is a functional model for writing fault-tolerant software that is based on attribute grammars. With this approach, a program is written as a series of module decompositions,with provisions for redoing and replicatingmodules used to implement fault-tolerance requirements. The functional nature of the model and the independence of decompositions makes FTAG especially well-suited for implemen...

متن کامل

A fault tolerant NoC architecture using quad-spare mesh topology and dynamic reconfiguration

Network-on-Chip (NoC) is widely used as a communication scheme in modern many-core systems. To guarantee the reliability of communication, effective fault tolerant techniques are critical for an NoC. In this paper, a novel fault tolerant architecture employing redundant routers is proposed to maintain the functionality of a network in the presence of failures. This architecture consists of a me...

متن کامل

Voting Algorithm Based on Adaptive Neuro Fuzzy Inference System for Fault Tolerant Systems

some applications are critical and must designed Fault Tolerant System. Usually Voting Algorithm is one of the principle elements of a Fault Tolerant System. Two kinds of voting algorithm are used in most applications, they are majority voting algorithm and weighted average algorithm these algorithms have some problems. Majority confronts with the problem of threshold limits and voter of weight...

متن کامل

Ditto Processor

Concentration of design effort for current single-chip Commercial-Off-The-Shelf (COTS) microprocessors has been directed towards performance. Reliability has not been the primary focus. As supply voltage scales to accommodate technology scaling and to lower power consumption, transient errors are more likely to be introduced. The basic idea behind any error tolerance scheme involves some type o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Math. Oper. Res.

دوره 24  شماره 

صفحات  -

تاریخ انتشار 1999